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Abstract 

£C) . We consider killed Markov decision processes for countable models on a finite time- 

interval. Existence of a uniform e-optimal policy is proven. We show the correctness 

^^ \ of the fundamental equation. The optimal control problem is reduced to a similar 

CN . problem for the derived model. We receive an optimality equation and a method for the 

'— ' construction of simple optimal policies. The sufficiency of simple policies for countable 

C*H, models is proven. We show the correctness of the Markoman property. Additionally, 

•^L ■ a dynamic programming principle is considered. 
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1. Introduction. Markov decision processes arise in the different areas of the economics, 
in particular for the economic work planning of the separate business, economic sector or 
entire economics. At the beginning of each period we can build a plan for the next period 
knowing the last achieved state. The system development can be described mathematically 
as a deterministic process if we assume that the position of the system at the end of each 
period is uniquely defined by the state at the end of the period and by a plan for this 
period. 

It is necessary to consider the influence of such factors as meteorological conditions, 
demographic transition, demand fluctuations, the imperfection of the compound production 
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account these factors: if we know the state at the beginning of the period and the plan, 
we can only calculate the probability distribution for the next period. Therefore, leaving 
aside the system states in the past periods we come to the idea of Markov decision process 
("the future depends not on the past, but only on the present"). 

The Markov decision processes are well described in [1]: the definition of Markov de- 
cision process is given, the concept of "model" Z^ is presented, the definition of policy 
7r is given, the assessment of policy - lo(it) and v - assessment of process Z M are defined, 
the existence of a uniform e-optimal policy is proved, the optimality equation and method 
for simple optimal policies constructing are presented, the sufficient of simple policies for 
countable models is proved, the correctness of the Markovian property is shown and dy- 
namic programming principle is considered. 

In [1] the model does not take into account one risk factor, namely the probability of 
bankruptcy at some determined moment of time. As a result, we come to the idea of killed 
Markov decision process where the business can crash with some nonzero probability at 
every moment of time, with the exception of the initial state. 

The concept of the killed Markov decision process brings us closer to the real economic 
system which is not common without risk. 

1 Revised and corrected version of the paper published in Transactions of NAS of Azerbaijan, (2010), 
vol. XXX, No 4, pp. 141-152. 



2. Killed Markov decision process. Let X t (t — m, . . . , n) and let A t (t = m+1, . . . , n) 
be countable or finite sets and at least one of them is countable. To the arbitrary a G A t 
is assigned a probability distribution p(-\a) — ¥(x t — x\a t — a,x t -i) on X t . 

Definition 1. The function p which defines the law of the transition from A t to X t is 
called the transition function. 

Definition 2. The point x* = x m G X t is called killed state, and p(x*\a) - the proba- 
bility of kill if P(x t+ i = x*\a t = a) = P(x t+ i = x m \a t = a) = p(x*\a),x m G X m . 

Remark 1. In other words, the system moves into the initial(home) state when it hits a 
killed statefprocess is killed). 

From the definition of the killed state it follows: 

Va e A t 3x* e X t : p(x*\a) = 1 - ^ p(x\a) > 0. 

x£Xt\x* 

Definition 3 (Killed Markov decision process) . A killed Markov decision process on a time 
interval [to, n] is defined through the following objects: 

1. Sets X mi . . . , X n (spaces of states); 

2. Sets A m+ i, . . . , A n (spaces of actions); 

n n 

3. The projection mapping j : A — > X where A = [J A t , X = (J X t : j{A t ) = 

t—m+l t—m 

X t -i \ {x*},x* G X t _i,(t = m + 2,...,n) and j(A m+1 ) = X m ; 

4- The probability distribution p(-\a) — P(x t — x\a t = a,x t -i) on X t with killed states 

P(x t+ i = x*\a t = a) = P(x i+ i = x m \a t = a) =p(x*\a) > 0; 

5. The function q on A (reward function); 

6. The function r on X n (terminal reward); 

t 

7. Thefunctionc (crash function) , defined on the killed states c(x*) = — ^2 max q(ai),x* G 

i=m+l a i eA i 

X t , t = m+1, . . . , n (function c ensures a total bankruptcy - total loss of accumulated capital 
or more); 

8. The initial distribution /i on X m . 

A stochastic process defined through (1-8) is called the killed Markov decision pro- 
cess or the model and it is denoted by Z* . If the initial distribution /i is concentrated at 
the point x, we shall write Z*. 

Definition 4. The trajectory I = x m a m +ix m+ i . . . a n x n is called the way. The set of all 
ways we denote L = X x (X x A) n . 

Our goal is to find a decision method which maximizes the mathematical expectation 
of the assessment of way I: 



I(l,X*)= ]T [q{a t ) + c(x* t )]+r(x n ), (2.1) 



t=m+l 

where: 

x* = (#m + i, . . . , a;*) - vector of killed states; 
I = x m a m+i, • • • , a n x n - way. 

The decision method is meant to be some policy. 



3. Policies. 

Definition 5. Let A(x) C A is the set of all available actions at the state x G X . <p(x) : 
X — > A(x) is called the simple policy if (p(x t -i) = a t for arbitrary x t which is not a 
killed state with the probability distribution p(-\a t )(m < t < n) and x m with the initial 
distribution \i. 

Remark 2. When we use the simple policy (p(x) we get the way I = x m a m +i, ■ ■ ■ ,a n x n . 

Definition 6. The mapping n : H — > w(-\h G H) is called a killed policy, where ir(-\h G 
H) is a probability distribution on A(xt-i) and H = X x (Ax X)* -1 is a space of histories 
up to epoch m < t — 1 < n (h G H <=> h = x m a m +i, . . . , at-iXt-i) ■ 

Remark 3. Obviously, x t -\ ^ x* . 

Definition 7. Killed policy ir(-\h) is called a Markov policy if ir(-\h) — Tr(-\xt-i)- 

The next conceptions can not be well-defined without the assumption: 

Assumption 1. The reward function q and the terminal reward function r have the supre- 
mum, 3 sup q(a) and 3 sup r(x). 

a£A xeX n 

Definition 8. Let p(-\a) be the transition function and let ir(-\h) be a policy. Every initial 
distribution fi is assigned to a probability distribution P* in the space L which has such the 
notation: 

r \i , x J r \x ra a ra +\, . . . , ci n x n , x rri _j_^, . . . , X n ) — 

= ii(x m )ir{a m+1 \x m )p{x m+1 \a m +i)p(x* m+1 \a m+1 ) ■ . . . ■ 7r(o„|ft„_i)p(a; n |o n )p(a;* \a„) (3.1) 

Remark 4. After the definition of the measure P* the way I can be interpreted as a 
stochastic process. Additionally, this process is called the Markov process if the policy ir is 
a Markov policy. 

For all functions £ from space L the mathematical expectation of £ is given by 

£*(O=X>0 P *(*.**) (3-2) 

The assessment (2.1) of the way I is an example of such function. Next, we denote its 
expectation uj: 

n 

to = E*I{1, x*) =E*\Y1 fe( a *) + c ( x *t )] + r ( x «)] ( 3 - 3 ) 

t=m+l 

Definition 9 (Assessment of policy). The value to from (3.3) is called the assessment 

of policy ir and is the function of the variable it (u) = t-j(ir)) for the killed Markov decision 
process Z* 

The goal of the research is the maximization of function u>(ir). 

Definition 10 (Assessment of process), v = supw(-7r) is called the assessment of killed 

Markov decision process Z* or assessment of initial distribution /j,. 

Remark 5. v(x*) = c(x*). 

Definition 11 (e-optimal policy). A killed policy it is called e-optimal for Z* if Ve > : 
w(/Lt, 7r) > v{p) — e. 

Definition 12 (Uniform e-optimal policy). A killed policy is called uniform e-optimal 
or e-optimal for process Z* if it is e-optimal for Z* for all /i - initial distribution. 



4. Existence of uniform e-optimal policy. Let n x is e-optimal policy for process Z* . 
Its existence follows from the definition of the supremum. 

We want to build a killed policy tt which is e-optimal for the model Z* by using a 
sequence of the killed policies tt x . 

It's natural to use the policy tt x when x is a starting point. Formally, 

Tt(-\h)=TT x{h) (-\h) (4.1) 

where x(h) - the initial state of history h. It is clear that formula (4.1) defines some 
policy tt and this policy will be e-optimal. It means that Ve > : u(x, tt) — lo(x,tt x ) > 
v(x) — e, Vx G X m . 

Proposition 1 (Existence of the uniform e-optimal killed policy). Every killed policy tt 
from (4-1) which is e-optimal, i.e. 

w(x, 7f) > v(x) — e, (x G X m ), Ve > 

is uniform e-optimal. It means that V/u,Ve > : supw(/x, tt) < u>(fi,7f) + e. 

TT 

Proof. From (3.1)-(3.3) it follows that Vtt: 

win, tt) = Y1 J C' x*) P *V, **) = E M*Ms, *)• I 4 - 2 ) 

leL x m 

Hence, it appears 

From the received inequalities it follows that 



supw(/U, 7r) < 2_. / i ( a; ) t/ ( x )i (4-3) 

w(m,tt) >^/i(s;)i/(i)-e. (4.4) 



According to the arbitrariness of e > we get now from (4.3) and (4.4) 

sup u>(/j,, tt) — y^ n(x)v(x) < w(/x, 7f) + e. (4-5) 

x m 

So the policy ff is uniform e-optimal. Proposition 1 is proved. 
Corollary 1. For all initial distributions /x: 

i/(/i) = mi/- (4-6) 

Proof. It follows from ^(m) = X) t l {x)i'(x) = jiv. 

x m 

Remark 6. Formulas (4-2) and (4-6) allow us to reduce the analysis of the processes Z* 
for all m to the analysis of the processes Z*, \/x G X m . 

The policy tt is built of the sequence tt x , (x G X m ) and has the following property (1): 

For all initial distribution of the state x G X m the probability distributions in space L 
which are assigned to the policies tt and tt x from (3. 1) are equal. 

Definition 13. Ifn satisfies the property (1) then tt is called the combination of policies 



5. Derived model and fundamental equation. The decision process is a quite num- 
ber of consecutive steps. The first step is the choice of probability distribution on A m+ i 
which depends on initial state. Since the choice is taken every initial distribution fi on X m 
accords with probability distribution (i on X m +\. Now we consider /i as initial distribution 
in moment of time m + 1 . 

As a result, we divide our maximization problem by two problems: 

1. Choose the optimal policy for the next moments of time for every initial distribution 
on X m+ \\ 

2. Choose the first step according to maximum reward and maximum value of the 
optimal policy assessment in the next time moments for initial distribution fx. 

Definition 14 (Derived model). The model which is build of the model Z* by deletion X m 
and A m+ \ is called the derived model and it is denoted Z* . 

Proposition 2 (Fundamental equation). 

u(x, n) = ^2 n(a\x) (q(a) + w(Po, ^a)J , (5.1) 

A(x) 

where p a = p{-\a),7T a (-\h) = n(-\yah), 

a G A rn+ i, y = j(a), h is a history in model Z* . 

The equation (5.1) is called fundamental and expresses the assessments of the ran- 
dom policy ir in model Z* in terms of the assessment uj of some policies in the model 
Z*. 

Proof. According to (4.2) we get 

w(Pa,7r a )= ^ p(y\a)u(y, TT a ) (5.2) 

x m+1 

Let consider the spaces of ways L and L in the models Z* and Z*. Let P* is the 
probability distribution on L according to the initial state x and the policy 7r and let P* 
is the probability distribution on L according to the initial distribution p a and the policy 

TTa- 

According to (2.1) and (3.1) VI € L we get 

I(xal,x*) = q(a) + I{l,x*_ 1 ) (5.3) 

P*(xal,x*) = ^(al^P^^xli) (5.4) 

a G A(x),x_ 1 = {x m+2 , ■ ■ ■ ,x n ), (x m+ i,x_ 1 ) = x . 
Under the notations in (3.2) and (3.3) we get 

oj(x,tt)=^2p*(1,x*)I(1,x*) (5.5) 

L 

Lb(p ai ir a ) = ^^(1,^)7(1,^) (5.6) 

L 

The measure P*(l,x*) is nonzero only for ways which have the starting point x, i.e., 
for xal. That is why by the substitution in (5.5) of the expression of 1(1, x*) from (5.3) 
and the expression of P*(l,x*) from (5.4), and according to (5.6) we get the fundamental 
equation (5.1). Proposition 2 is proved. 

Remark 7. The fundamental equation is correct even without Assumption 1. 



6. Reducing the problem of the optimal decision to analogical problem for the 
derived model. From fundamental equation (5.1) it follows the following inequality 

ui(x,ir) < sup [17(a) +Lu(p a ,ir a )} < sup [o(a) + v(p a )} (6.1) 

A(x) A(x) 

\fx G X m and for every tt (y which is the assessment of model Z*~). 

We denote u(a) — q(a) + is(p a ), (a G A m+ i) and call this value - assessment of the 
action a. 

According to (4.3) and v{x*) — c{x*) we get u — Uv where operator U transforms 
functions on the non-killed states on X to the functions on A and is given by 

Uf(a) = q(a) + ^p(y|a)/(y) + £p(j/»c(j/') (6.2) 

y y" 

where y and y* are the non-killed states and the killed states, respectively. 
Let the operator V transforms the functions on A into the functions on non-killed and 
non-terminal states on X and satisfies 

Vg(x) = sup g(a) (6.3) 

a£A(x) 

Let us write the inequality (6.1) by using the operator V: 

L0(x, 71") < Vu{x). 

Then we consider sup of the right and the left part of w(x, tt) < Vu(x) and we get 

v < Vu. (6.4) 

Remark 8. Later we show the conditions which assure the equality in (6.4)- 

Definition 15 (Product of policies). Let n be a killed policy in the model Z* and to 
x G X m is assigned some probability distribution 7(-|x) on A m+ i which is concentrated on 
A(x). When we choose on the first step an action a and on all other steps we use the killed 
policy ■n then we get the killed policy ix in the model Z* . This policy is called the product 
of policies 7 and tt and is denoted by jtt. It has the expression 

MM = / ^''^ f° T h = x e Xm > 
n ' 1 ' ~ \ Tc(-\h) forh = xah. 

Proposition 3. Let tt — 771 is a product of the killed policies 7 and tx . If tt is uniform 
e' -optimal for model Z* then: 

v = Vu. (6.4) 

Proof. The fundamental equation (5.1) for a product of policies has the following 
expression 

u(x, 77f) = ^2 li a \ x ) (<?( a ) + w(p a , ^)) ( 6 - 5 ) 

A(x) 

Since 7r is e'-optimal (it exists V e' > according to Proposition 1.) we get o3(p a ,7r) > 
i>{p a ) — e' , and according to appearance of u equation (6.5) transforms to 

uj(x,jtt) > 2, j(a\x)u(a) — e' . 

A(x) 



Lets consider the set 

A x (x) — {a: a E A(x),u(a) > Vu(x) - x} (x e X m ). 

A x (x) is nonempty for all x > 0. Let 7(-|x) be a probability distribution on A{x) which 
is concentrated on A x {x). 

Then __ 

^2 j(a\x)u(a) > Vu{x) - \- 

A(x) 

Since e' + x < e we get 

co(x, 7r) > Vu(x) — e, (x € X m ). (6.6) 

According to (6.4) and (6.6) Proposition 3 is proved. 

Corollary 1. The assessment v of the model Z* is expressed in terms of the assessment 
v of the model Z* in the following way: 

v = Vu, u = Ui> (6-7) 

where operators U and V are defined in (6.2) and (6.3); 
Corollary 2. For all \ > exists such ip(x) : X m — ^ A m+ i(x): 

u(ip(x)) > v(x) - x (6-8) 

Here 7(-|x) can be the distribution concentrated at one point tp(x) £ A x (x). 

Corollary 3. Let e' and x l> e the arbitrary nonnegative numbers. If ~k is uniform e' - 
optimal for the model Z* and ip is such as in Corollary 3 then the killed policy tprc is 
uniform (e 1 + x)-optimal for the model Z* . 

7. Optimality equation. Method for the construction of simple optimal policies. 

Let assume that in our model Z* m = 0. Let consider the models Zg, Z-j*, . . . , Z* where 
Z* = Zq and Z% is a derived model of Z£_ 1 . Let denote the assessments v and u of the 
model Z% as v t and u t +i{v t on X tl Ut+i on A t +i). The reward function q and the transition 
function p we denote qt and pt ■ 

According to the results of section 6 we get 

v t -i = Vu u u t = Uvt(l<t< n) (7.1) 

where 

U t f(a) = q t (a) + ^ Pt(y\a)f(y) +Pt(y*\a)c(y*), (a G A t ,y* £ X t ), 

vex t 

V t g(x) = sup g(a), (x e X t _i), 

A(x) 

and v n = r. 

Equations (7.1) are called the optimality equations. Let T t = VtUt then the opti- 
mality equations transform to 

v t -i=T t u t . (7.1) 

From (7.1), (7.1) and the condition v n — r we calculate v n , v n -i, . . . , vq. Then we choose 
the action ipt(x) : X t -\ — > A t (x) for which holds 



UtW> "t-i - Xt- (7-2) 

Vi = 1,2, . . . , n and for all nonnegative XI1X2, ■ ■ -Xn- 

According to Corollary 3 of Proposition 3 the simple policy ip = ipiip2 ■ ■ ■ ipn is uniform 

n 

e-optimal for the model Z* = Zq and s = Y] \i- The equation (7.2) can be rewritten as 

T^vt > v t -i -Xt, (7.2) 

where the operator T$ t transforms functions on X t to functions on X t -\ in the following 
way 

T^f(x) = q t [Mx)] + Z 2p(y\Mx))f(y)+Pt(y*\a)c(y*)- (7.3) 

x t 

Proposition 4. Let ir be an arbitrary killed policy in the derived model Z£ {k = 1, 2, . . . , n) 

and let ip t : X t -\ — > A t (x) (t = 1, 2, . . . , k) are arbitrary too then 

^0(^,-01^2 •••V'fci") = T 4:i T -4>2 ■•■Tj, k Wk(x,n), (7.4) 

Proof. It follows from the fundamental equation (5.1), formulas (5.2), (7.3) and the 
mathematical induction. 

Remark 9. It follows from (7.4)-' the result will not change if our decision process is killed 
at the moment of time k and the terminal reward as the assessment of policy tt is taken. 

Remark 10. If we can choose ipt with \t = in (7.2) Vi = l..n then the simple policy 
ip = ipi . . . ip n is called uniform optimal. 

8. The sufficiency of the simple policies for countable models. The question 
arises: do we lose something by using only simple policies? The previous result can not 
give us the answer. It only makes our losses indefinitely small. 

Theorem 1 (Sufficiency of the simple policies). Let fi is a fixed initial distribution and let 
7r is a arbitrary killed policy then there exists ip -simple policy such that 

w(M,7r) < u(n,ip). (8.1) 

Proof. It follows from Proposition 5 and Proposition 6. 

Proposition 5. For all ji and for all killed policies n there exists the Markov policy 9 such 
that 

w(n t e)=u(ji,ir) (8.2) 

These two policies are called equivalent. 
Proposition 6. For all Markov policies 9 there exists a simple policy ip such that 

w(ji,<p)>w(ji,6) (8.3) 

We say that ip dominates 9 uniformly. 
Proof.(Proposition 5). Let 9 is Markov policy and 

9{a\x) = P*{a t = o|!Ct-i = *} = ^ {x J i=x] ( 8 - 4 ) 



(a e A t , x e X t _i, m + l<t<n), 

where P* is a probability measure in the space of ways L which is assigned to the initial 
distribution /i and to the policy n. 

Remark 11. The expression on the right side of (8.4-) makes no sense for ¥*{xt~i = 
x} = 0. So, for such x(in particular for killed states) we choose the arbitrary distribution 
on A(x) instead of 9{-\x). 

Let Q* denotes a probability distribution on space L which is assigned to the initial 
distribution \i and to the killed Markov policy 9. 

The distribution Q* does not match with P* in the general case, but it is enough for 
proving (8.2) if any of x m , a m+ i, . . . , a ni x n and x* m+1 ,x* m+2 , ■ ■ ■ , x* n has the same proba- 
bility distribution according to measures P* and Q*. 

The following assertion holds 

n n 

w(a*,tt)= ]T P*q(a t )+ ]T P*c(x* t ) + P*r(x n ), 

£=m+l t— m+1 

n n 

w(M)= Y. ( Q*'?( a *)+ Yl Q*cK) + Q*r{x n )- 

t— m+1 t—m-\-l 

We shall use the mathematical induction to prove this. 

The basis of induction: (8.2) holds for x m because P* = Q* = /i. 

The induction hypothesis: let (8.2) holds for Xt-i- Let's check it for Oj. 

Since 9 is a killed Markov policy then 

q*{xt-ia t = xa} = Q*{x t -i = x}9{a\x), (a £ A t , x € X t _i). (8.5) 

Hence, from (8.4) and (8.5) we get 

P*{a t = a} = Y W*{x t -iat = xa} = ^ F *i x t-i = x}9(a\x) = 
i£X,_i 16X1-1 

= J^ Q*{x t -i = x}9{a\x)= Y Q*{xt-ia t = xa} = q*{a t = a}. 

xeXt-i xeXt-i 

So, our proposition holds for a t . 

The induction hypothesis: let (8.2) holds for a t . Let show it for Xt- 

From the definition of the transition function we get 

P*{a t x t = ax} = P*{a t = a}p{x\a), (8.6) 

Q*{a t x t = ax} = Q*{a t = a}p{x\a). (8.7) 

From (8.6) and(8.7) it follows 



P*{x t =x} = Y F *i a txt =ax} = Y F *i a t = a}p(x\a) = 

aeA t a£A t 

= J2 ^>* = a}p(x\a) = J2 ®*i a t x t = ax ) = ®*i x t = *}> ( x e x t)- 

aeA t aeA t 

Proposition 5 is proved. 

Proof. (Proposition 6.) For proving this proposition we need the following lemma. 



Lemma 1. Let f is a arbitrary function and let v is a arbitrary probability distribution on 
countable space E . 

If vf < +00 then the set T = {x : f(x) > vf} has a positive measure v, namely 

u(T) > 

(See proof in [1]). 

According to (4.2) the condition (8.3) is equal to 

lo{x, ip) > lo(x,9), \/x G X rn . 

Let separate the killed Markov policy 9 by a product of the policies 9 = j9' where 7 is 
the restriction of 9 on X m and 9' is the restriction of 9 on X m+ \ (J X m+ 2 • • • U X n - 
According to the fundamental equation (5.1) it holds 

uj( x , 9) = j x f, 

where 7 X ( - ) = 70I&) is the probability distribution on A(x), 

and f(a) = q(a) + u/(p , 9'), (a E A m+1 ). 

Since Lemma 1 for A(x) C A(x) it follows r ) x (A{x)) > 0, where A(x) — {a : f(a) > 
7 X / = lo{x, 9)}. As a result, A(x) is nonempty. If ip(x) is an arbitrary point of A(x) then 
f(ip(x)) > u?(x,9). But since the fundamental equation (5.1) we get f(ip(x)) — cj(x,ijj9') 
and 

ui(x,tpe') >lo{x,9). 

Let assume that condition (8.3) holds for the derived model Z*. Then exists a simple 
policy ip' in Z* which uniformly dominates the killed Markov policy 9' . According to the 
fundamental equation (5.1) and our assumption we get 

u)(x,ip(p') = q(tp(x)) +uj'(p^ {x) ,if') > q(ip(x)) + lu'(p^ {x) ,9') = u(x,ipd') > uj(x,9). 

In the model Z* simple policy ip — ipcp' dominates 9 uniformly. Finally, (8.3) holds for 
model Z* too. 

Proposition 6. is proved. 

9. Markovian property. Let < k < n, let use the killed policy p on the interval [0, k] 
and killed policy n on the interval [k,n]. Doing analogically to Definition 15 we can say 
that policy pir is used. 

Proposition 7. Let Lq is the space of ways on the interval [0,n], let L^ is the space of 
ways on the interval [k, n] and let P* p7T is the probability distribution which is assigned 
to the initial state x and to the killed policy pix , and analogically P*^ is the probability 
distribution on Lk ■ 

Then V£ = ^(xkOk+i ■ ■ ■ x n ) on Lk holds 

Krt = K p [KL& (9- 1 ) 

Proof. Ml = y bi . . . b k y k b k+1 . . . y n according to (3.1) 

P^iyoh . . . y n ) = P* p (cy k )P;*(y k d), (9.2) 

where c = yob\ . . .b k , d = b k+ \ . . . y n . Any function ^ on the space L k can be interpreted 
on Lq like function which does not depend on xoai, . . . ,a k . That is why we multiply the 
both sides of (9.2) by £(y k d) and sum up over all ways 
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KTS = E K p (cvh) E P* v :(y k d)i(y k d). (9.3) 

cVk d 



But P*£ (yd) = for y ^ y k and it follows 

E P y:(ykd)t(y k d) = E P;:(yd)t(yd) = F(y k ). (9.4) 

c? yd 

By substitution in (9.3) the expression from (9.4) and according to ^ P* p (cy k )F(yk) = 

cyk 

E* p F(x k ), we get (9.1). Proposition 7 is proved. 

Corollary l.(Markovian property) Let v(y) — Py{x k — y} (y € X k ) then V/Lt 

In particular 

E*/* £,{xk,ak+\ ...x n )= E*J^(x k a k+ i . . . x n ), (9.5) 

It follows form (9.1) and £ v(y)P* n Z, = £**■£. 

The formula (9.5) shows that the probability distribution for a part of the trajectory 
does not depend on the distribution p, and policy p on the interval [&,n]. Namely, the 
probability forecast of the "future" (£) depends not on the "past" {/i,p), but only on the 
"present" (v). Actually, it is already the Markovain property. 

Let use the Markovian property for the assessment of a killed policy p-K on the intervals 

n 

[0, k] and [k, n]. Instead of £ we take £ = J^ fe( a t) + c ( 2; t )] + r(x n ) and by substituting 

t=fe+i 
in (9.5) we get 



u(ji, pit) = E ^[«(ot) + c(x?)] + «(!/, tt) = E ^ p [«(ot) + c{x* t )\ + «(!/, tt). (9.6) 
t=l 4=1 

The summation in (8.6) expresses the assessment co(p, p) of policy p for a zero terminal 
reward, namely, co(p, /cot) = w(/i, p) + w(z/, 7r). 

There is also another interpretation of (9.6). According to (4.2) and v(y) = P* p {x k = 
y} (y e X k ) we get 

w (^ ^ = E "(yMV' 7F ) = E *n P u{Xk,TT), 



u(ji, pn) = £;£ g(ot) + «(x fc , tt)]. (9.7) 

Hence, the assessment of killed policy pir is equal to the assessment of the killed policy 
p with the terminal reward w(-, n) at the moment of time k. 

10. Dynamic programming principle. Let Z* be the model on the interval [0, n] and 
let < s < t < n. Let Z* t [f] denotes the model which is taken from the model Z* by 
restriction of the interval [0,n] to [s,t]. We define the terminal reward / at the moment 
of time t. Moreover, denote v\\f\ as the assessment of the model Z* 1 with the terminal 
reward /. Obviously, z/*[/] = (VUf- 3 f = T^ 3 f on X. 
Since Vi 6 [0, n] it holds 

"o M = "o["?M] on A (r on A„). (10.1) 
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The equation (10.1) is equivalent to the optimality equations (7.1) and the condition 
v n = r. It is called the Dynamic programming principle and it means that for the 
optimization of the decision on the interval [0, n] with terminal reward r we must first 
optimize the decision on interval [t, n] (with such terminal reward) and then optimize the 
decision on the interval [0,t] with terminal reward ^"[r]. 

In particular according to (9.1) it follows that if 7r" is a uniform e-optimal killed policy 
for Z£ n with terminal reward r and 7r' is a uniform e-optimal policy for Zq' with the 
terminal reward v™\r\ then the killed policy 7r = ir"ir' has the assessment Vq[t\ and is 
uniform e-optimal for the model Z™(with terminal reward r). 
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